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Abstract. We give a greedy learning algorithm for reconstructing an evolutionary tree based 
on a certain harmonic average on triplets of terminal taxa. After the pairwise distances between 
terminal taxa are estimated from sequence data, the algorithm runs in 0(n 2 ) time using 0(n) work 
space, where n is the number of terminal taxa. These time and space complexities are optimal in the 
sense that the size of an input distance matrix is n 2 and the size of an output tree is n. Moreover, 
in the Jukes-Cantor model of evolution, the algorithm recovers the correct tree topology with high 
probability using sample sequences of length polynomial in (1) n, (2) the logarithm of the error 
probability, and (3) the inverses of two small parameters. 
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1. Introduction. Algorithms for reconstructing evolutionary trees are useful 
tools in biology jl6],|2^] . These algorithms usually compare aligned character sequences 
for the terminal taxa in question to infer their evolutionary relationships. In the 
past, such characters were often categorical variables of morphological features; newer 
studies have taken advantage of available biomolecular sequences. This paper focuses 
on datasets of the latter type. 

We present a new learning algorithm, called Fast Harmonic Greedy Triplets (Fast- 
HGT), using a greedy strategy based on a certain harmonic average on triplets of 
terminal taxa. After the pairwise distances between terminal taxa are estimated from 
their observed sequences, Fast-HGT runs in 0(n 2 ) time using 0(n) work space, where 
n is the number of terminal taxa. These time and space complexities are optimal in 
the sense that n 2 is the size of an input distance matrix and n is the size of an output 
tree. An earlier variant of Fast-HGT takes 0(n 5 ) time ||. In the Jukes-Cantor model 
of sequence evolution generalized for an arbitrary alphabet |22] , Fast-HGT is proven 
to recover the correct topology with high probability while requiring sample sequences 
of length £ polynomial in (1) n, (2) the logarithm of the error probability, and (3) the 



inverses of two small parameters (Theorem 3^). In subsequent work |J, Fast-HGT 
and its variants are shown to have similar theoretical performance in more general 
Markov models of evolution. 

Among the related work, there are four other algorithms which have essentially 
the same guarantee on the length I of sample sequences. These are the Dyadic Closure 
Method (DCM) jl0| and the Witness- Antiwitness Method (WAM) of Erdos, Steel, 
Szekely, and Warnow, the algorithm of Cryan, Goldberg, and Goldberg (CGG) j|, and 
the DCM-Buneman algorithm of Huson, Nettles, and Warnow j^]. Not all of these 
results analyzed the space complexity. In terms of time complexity, DCM-Buneman 
is not a polynomial-time algorithm. CGG runs in polynomial time, whose degree has 
not been explicitly determined but which appears to be higher than n 2 . DCM takes 
O(n 5 logn) time to assemble 0(n 4 ) quartets using 0(n 4 ) space. The two versions of 
WAM take C(n 6 logn) and O(n 4 lognlog^) time, respectively. In the uniform and 
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Yule-Harding models of randomly generating trees, with high probability, these two 
latter running times are reduced to 0(n 3 polylogn) and 0(n 2 polylogn), respectively. 
Under these two tree distributions, Erdos et al. (lCj] further showed that with high 
probability, the required sample size of DCM is polylogarithmic in n; this bound also 
applies to WAM, CGG, DCM-Buneman, and Fast-HGT. 

Among the algorithms with no known comparbale guarantees on £, the Neighbor 
Joining Method of Saitou and Nei (22| runs in 0{n 3 ) time and reconstructs many 
trees highly accurately in practice, although the best known upper bound on its 
required sample size is exponential in n H . Maximum likelihood methods p| , |l5| are 
not known to achieve the optimal required sample size as such methods are usually 
expected to p0| ; moreover, all their known implementations take exponential time to 
find local optima, and none can find provably global optima. Parsimony methods aim 
to compute a tree that minimizes the number of mutations leading to the observed 
sequences \L3^ ; in general, such optimization is NP-hard ||. Some algorithms strive 
to find an evolutionary tree among all possible trees to fit the observed distances the 
best according to some metric [[jj ; such optimization is NP-hard for L\ and Li norms 
§ and for L M §. 

A common goal of the above algorithms is to construct a tree with the same 
topology as that of the true tree. In contrast, the work on PAC-learning the true tree 
in the j-State General Markov Model |2^| aim to construct a tree which is close to the 
true tree in terms of the leaf distribution in the sense of Kearns et al. |l9[ but which 
need not be the same as the true tree. Farach and Kannan (l^] gave an 0(n 2 £)-time 
algorithm (FK) for the symmetric case of the 2-state model provided that all pairs 
of leaves have a sufficiently high probability of being the same. Ambainis, Desper, 
Farach, and Kannan [Q] gave a nearly tight lower bound on I for achieving a given 
variational distance between the true tree and the reconstructed tree. As for obtaining 
the true tree, the best known upper bound on t required by FK is exponential in n. 
CGG also improves upon FK to PAC-learn in the general 2-state model without 
the symmetry and leaf similarity constraints. 

The remainder of the paper is organized as follows. Section || reviews the gener- 
alized Jukes-Cantor model of sequence evolution and discusses distance-based prob- 
abilistic techniques. Section || gives Fast-HGT. Section [| concludes the paper with 
some directions for further research. 



2. Model and techniques. Section 2.1 defines the model of evolution used in 



the paper. Section 2.2 defines our problem of recovering evolutionary trees from bio- 



logical sequences. Sections 2.3 through 2.5 develop basic techniques for the problem 



2.1. A model of sequence evolution. This paper employs the generalized 
Jukes-Cantor model ||^] of sequence evolution defined as follows. Let m > 2 and 
n > 3 be two integers. Let A = {ai, . . . , a m } be a finite alphabet. An evolutionary 
tree T for A is a rooted binary tree of n leaves with an edge mutation probability p e for 
each tree edge e. The edge mutation probabilities are bounded away from and 1 — 
i.e., there exist / and g such that for every edge e of T, 0<f<p e <g< 1 — Given 
a sequence Si ■ ■ ■ S£ £ A associated with the root of T, a set of n mutated sequences in 
A is generated by I random labelings of the tree at the nodes. These t labclings arc 
mutually independent. The labelings at the j-th leaf give the j'-th mutated sequence 
sf^ ■ ■ ■ s9\ where the i-th labeling of the tree gives the i-th symbols 8+ , . . . , s\ n \ 
The i-th labeling is carried out from the root towards the leaves along the edges. The 
root is labeled by Sj. On edge e, the child's label is the same as the parent's with 
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probability 1 — p e or is different with probability -^4 for each different symbol. Such 
mutations of symbols along the edges are mutually independent. 

2.2. Problem formulation. The topology *f?(T) of T is the unrooted tree ob- 
tained from T by omitting the edge mutation probability and by replacing the two 
edges ei and e2 between the root and its children with a single edge eo- Note that 
the leaves of \&(T) are labeled with the same sequences as in T, but 9(T) need not be 
labeled otherwise. The weighted topology \E' W (T) of T is ^(T) where each edge e ^ cq 
of \&(T) is further weighted by its edge mutation probability p e in T and for technical 
reasons, the edge eo is weighted by 1 — (1 — p ei )(l — Pe 2 )- 

For technical convenience, the weight of each edge XY in ^ W (T) is often replaced 



by a certain edge length, such as Axy in Equation (2.5), from which the weight of 
XY can be efficiently determined. 

The weighted evolutionary topology problem is that of taking n mutated sequences 
as input and recovering ^ W (T) with high accuracy and high probability. Fast-HGT 
is a learning algorithm for this problem. 

Remark. The special treatment for e\ and e2 is due to the fact that the root 
sequence may be entirely arbitrary and thus, in general, no algorithm can place the 
root accurately. This is consistent with the fact that the root sequence is not directly 
observable in practice, and locating the root requires considerations beyond those of 
general modeling j22|. If the root sequence is also given as input, Fast-HGT can be 
modified to locate the root and the weights of e\ and ei in a straightforward manner. 

2.3. Probabilistic closeness. Fast-HGT is based on a notion of probabilistic 
closeness between nodes. For the «-th random labeling of T, we identify each node 
of T with the random variable Xi that gives the labeling at the node. Note that since 
s\ ■ ■ ■ si may be arbitrary, the random variables Xi for different i are not necessarily 
identically distributed. For brevity, we often omit the index i of Xi in a statement if 
the statement is independent of i. 

For nodes X and Y S T, let pxy = Pr{ 1^7}. The closeness of X and Y is 

( m 

(2.1) a XY = Pr{X = Y} -Pr{X ^Y} = 1 - ap XY , where a 



1 ' m — 1 

Lemma 2.1 (folklore). If node Y is on the path between two nodes X and Z in 
T, then o X z = o-xy&yz- 

If X and Y are leaves, their closeness is estimated from sample sequences as 



1 

( 2 - 2 ) ^ = 7E J W 

i=l 

where X\, . . . , Xg and Yi, . . . , Yg are the symbols at positions 1, . . . , £ of the observed 
sample sequences for the two leaves, and 

xy \ 1 if x = y. 



The next lemma is useful for analyzing the estimation given by Equation (2.2) 
Lemma 2.2. For e > 0, 

(2.3) Pr/ ^ < 1 - e ) < exp (- ^ia 2 XY e 2 

I (TXY J V. Cl A 

(2.4) pJ^Gl >i + e \ <cxpf-^a| y e 2 

I (TXY I V Ct z 
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Proof. By Equation (|2.2| 



Pr< j ^AT < i _ e }, = Pr << \ ' (/„ „ _ o~xy) < -laxve 



0~XY 



K i=l 

\^ 

I i=l 



Since < < 1 and W,[IxiYi — °~xy ] = 0, we use Hoeffding's inequality |17| on 



sums of independent bounded random variables to have Equations (2.3) and ( |2.4| ). D 
2.4. Distance and harmonic mean. The distance of nodes X and F € T is 

(2.5) A X y =- In ctxy- 

For an edge XY in T, Axy is called the edge length of XY. 

Fast-HGT uses Statement || of the next corollary to locate internal nodes of T. 
Corollary 2.3. Let X, Y, and Z be nodes in T. 

1. If X ^ Y, then A XY = A YX > 0. Also, A XX = 0. 

2. IfY is on the path between X and Z in T, then Axz = Axy + Ayz- 

3. For any a with o~xy < <r < 1, there is a node P on the path between X 
and Y in T such that o~(l — ag) 1 ^ 2 < o~xp < c(l — ag)^ 1 / 2 . Furthermore, if 
o~xy(1 — ctg) 1 ^ 2 < cr < (1 — ctg)^ 1 ! 2 , then P is distinct from X and Y . 

Proof. Statements [| and || follow from Equation (2.1) and Lemma State- 



ment H becomes straightfoward when restated in terms of distance as follows. For any 
A with Axy > A > 0, there is a node P on the path between X and Y in T such 
that A + - ln(1 2 -" 9) > A XP > A - ~ ln(1 2 "" 9) . Furthermore, if A XY - ' ln( 2 '" 9) > 
A > ~ ln(1 2 ~" g) , then P is distinct from X and Y. □ 

If X and Y are leaves, their distance is estimated from sample sequences as 



(2.6) 



±XY 



— \n&xY if oxy > 0; 
oo otherwise. 




A triplet XY Z consists of three distinct leaves X, Y, and Z of T. There is an 
internal node P in T at which the pairwise paths between the leaves in XY Z intersect; 
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see Figure 2.1. P is the center of JYZ, and XFZ defines P. Note that a star is 
formed by the ed ges o n the paths between P and the three leaves in XYZ. 

By Corollary |2.3| (p|), the distance between P and a leaf in XYZ, say, X, can be 
obtained as Axp — Axy+Axz ~ Ayz ; which is estimated by 



(2.7) 



IIP 



A 



The closeness of XYZ is ctxyz = 



-J— + ->- + -!- 

7 XY &XZ &YZ 



, which is estimated by &xyz 



i. i . i - XYZ is called positive if <7xy> <7xz, and oyz are all positive. 

XY a XZ a YZ 

The next corollary relates o~xyz and the pairwise closenesses of X , Y, and Z. 
Corollary 2.4. J/ djp < ayp < czp, i/ien oxy < txz < oyz, cxz > 
(Txyz, and 



>YP — 3°~XYZ- 



Proof. This corollary follows from Lemma 2.1 and simple algebra. □ 
The next lemma relates ctxyz to the prob abil ity of overestimating the distance 
between P and a leaf in XY Z using Equation (2.7). 
Lemma 2.5. For < e < 1, 



A 



±xp 



*XP 



> 



-ln(l-e) 



<3exp(-^a 



XYZ^ 



Proof. See §A.l. □ 



2.5. Basis of a greedy strategy. Let dxY deno te t he number of edges in the 
path between two leaves X and Y in T. By Lemma 2.1, o~xy can be as small as 
(1 — ag) dxY . Thus, the larger dxY is, the more difficult it is to estimate axY and 
Axy- This intuition leads to a natural greedy strategy outlined below that favors 
leaf pairs with small dxY and large o~xy ■ 

The g- depth of a node in a rooted tree T' is the smallest number of edges in a 
path from the node to a leaf. Let e be an edge between nodes u\ and u%. Let T[ and 
T2 be the subtrees of T' obtained by cutting e which contain u\ and 112, respectively. 
The g-depth of e in T' is the larger of the g-depth of Ui in T[ and that of u-i in T' 2 . 
The g-depth of a rooted tree is the largest possible g-depth of an edge in the tree. 
(The prefix g emphasizes that this usage of depth is nonstandard in graph theory.) 

Let d be the (/-depth of T. Variants of the next lemma have proven very useful 
and insightful; see, e.g., (9|-[ll]|- 

Lemma 2.6. 

1. d<l+Llog 2 (n-l)J. 

2. Every internal node P of T except the root has a defining triplet XYZ such 
that dxp,dyp, anddzp are all atmostd+1 and thus, o~xyz > (l — ag) 2< - d+1 K 
Every leaf of T is in such a triplet. 

Proof. The proof is straightforward. Note that the more unbalanced T is, the 
smaller its g-depth is. □ 

In T, the star formed by a defining triplet of an internal node contains the three 
edges incident to th e in ternal node. Thus, \&(T) can be reconstructed from triplets 
described in Lemma 2.6(2J) or those with similarly large closenesses. This observation 



motivates the following definitions. Let 



3V2 (V2-1 
~ [ V2 + 1 



(1 - ag) , CT sm = 



0~md 



(Tig 



G 
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Remark. The choice o( a\ g is obtained by solving Equations (3.2), (3.3) and (3.4). 

A triplet XYZ is large if gxyz > fig; it is small if <txyz < c sm . Note that by 
Lemma 2.6(2|), each nonroot internal node of T has at least one large defining triplet. 

Lemma 2.7. The first inequality below holds for all large triplets XYZ , and the 
second for all small triplets. 



(2.8) 
(2.9) 



Pr{ G XY z < CTmd } < exp ^- i^-i) io£j ■ 

( (V2-1) 2 A 

Pr{ a xyz > o- md } < exp ~ 36q 2 ia l J ■ 



Proof. See §A.2. □ 



A nonroot internal node of T may have more than one large defining triplet. 
Consequently, since distance estimates contain errors, we may obtain an erroneous 
estimate of ^(T) by reconstructing the same internal node more than once from its 
different large defining triplets. To address this issue, Fast-HGT adopts a threshold 



< A min < 



-ln(l-q/) 



based on the fact that the distance between two distinct nodes 
Fast-HGT considers the center P of a 



^min ^ 2 

is at least — ln(l — af); also let c = - CTfz^ ■ 
triplet XY Z and the center Q of another triplet XUV to be separate if and only if 



(2.10) 



Axp-A 



XQ 



> A r 



where A XP = (A XY + A xz - A YZ )/2 and A XQ = (A xu + A xv ~ A xv )/2. Notice 
that two triplet centers can be compared in this manner only if the triplets share at 
least one leaf. The next lemma shows that a large triplet's center is estimated within 
a small error with high probability. 

Lemma 2.8. Let P be the center of a triplet XYZ . If XYZ is not small, then 



(2.11) 



Pr 



±XP 



IIP 



> 



< 7 exp ( 



Proof. See §|AJ. □ 

We next define and analyze two key events £ c and £ g as follows. The subscripts 
c and g denote the words greedy and center, respectively. 



£ c is the event that for every triplet XYZ that is not small, 



A 



XP 



A 



XP 



< 



A min 
2 : 



A Y P - A 



YP 



< ^Spa and 



Azp-A 



ZP 



< A " in , where P is the 



center of XYZ. 

• £ g is the event that ctxyz > vx'Y'Z' for every large triplet XY Z and every 
small triplet X'Y'Z'. 
Lemma 2.9. 



Pr{f c }<2lQ)ex P (-| I fo i y 2 ); Pr{?g}<Q)«-M» 



1 



36a 2 



-lot 



Proof. The inequalities follow from Equation ( ^.ll|) and Lemma 2/7, respectively. 
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Algorithm Fast Harmonic Greedy Triplets 
Input: 

* Avnjv,, 



• Axy for all leaves X and Y of T which are computed via Equations (2.2) 
and ( |2.6| ) from n mutated length-^ sequences generated by T. 
Output: # W (T). 

Fl Select an arbitrary leaf A and find a triplet ABC with the maximum &abc- 
F2 if ABC is not positive then let T* be the empty tree, fail, and stop. 
F3 Let T* be the star with three edges formed by ABC and its center D. 
F4 Use Equation |Ti]) to set A* AD <- A AD , A* BD <- A BD , A* CD <- A C£) . 
F5 Set def(P) 4- {A, 5,(7}. 

F6 First set all S[M] to null; then for QiQ 2 G {AD,BD,CD}, Update-5(QiQ 2 )- 
F7 repeat 

F8 if 5[M] = null for all leaves M G T then fail and stop. 

F9 Find S[N] = (P1P2, NXY, P, A* PlP , A p2P , A^ p ) with the maximum a NXY - 
F10 Split PiP 2 into two edges P X P and P 2 P in T* with lengths A PiP and A PaP . 
Fll Add to T* a leaf A and an edge NP with length A^ p . 
F12 Set def(P) <- {7V,X,Y~}. 

F13 For every M with <S[M] containing the edge P1P2, set 5[M] «— null. 
F14 For each Q1Q2 G {PiP, P 2 P, NP}, Update-S(QiQ 2 ). 

F15 until all leaves of T are inserted to T*; i.e., this loop has iterated n — 3 times. 
F16 Output T*. 

Fig. 3.1. The Fast-HGT algorithm. 

Algorithm Update-5 
Input: an edge Q\Qi G T* 
Ul Find all splitting tuples for Q1Q2 G T*. 

U2 For each (QiQ 2 , MUV, Q, A* QiQ , A* Q2Q , A* MQ ) at line Q assign it to 5[M] if 
<7MC/y is greater than that of S [M]. 

Fig. 3.2. TTie Update-S subroutine. 



3. Fast-HGT. Section |3.l| de tails Fast-HGT. Section 3^ analyzes its running 
time and work space. Section 3.3 proves technical lemmas for bounding the algo- 
rithm's required sample size. Section 3.4 analyzes this sample size. 



3.1. 



The description of Fast-HGT. Fast-HGT and its subroutines Update-S 

K 



and 3.3, respectively. 



and Split-Edge are detailed in Figures 3.1 

Given A m i n and n mutated sequences as input, the task of Fast-HGT is to recover 
$ W (T), The algorithm first constructs a star T* formed by a large triplet at lines Fl 



through F3. It then inserts into T* a leaf of T and a corresponding internal node per 
iteration of the repeat at line F7 until T* has a leaf for each input sequence. The 
T* at line F16 is our reconstruction of 5 , W (T). For k = 3, . . . , n, let T£ be the version 
of T* with k leaves constructed during a run of Fast-H GT; i.e., is constructed at 
line F3, and T£ with k > 4 is constructed at line Fll during the (k — 3)-th iteration 
of the repeat. Note that T* is output at line F16 . 

A node Q is strictly between nodes Q\ and Q2 in T if Q is on the path between 
Qi and Q 2 in T but Q ^ Qi, Q ^ Q2, and Q is not the root of T. At each iteration of 
the repeat, Fast-HGT finds an edge PiP 2 in T* and a triplet NXY where X,Y G T*, 
N T*, and the center P of NXY is strictly between on Pi and P 2 in T*. Such PiP 2 
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Algorithm Split-Edge 

Input: an edge PiP 2 in T* and a relevant triplet NXY with center P. 
Output: If P is strictly between Pi and P2 in T and thus can be inserted on 
P1P2, then we return the message "split" and the edge lengths Ap P , AJ, 2 p, and 
A^f P . Otherwise, we return a reason why P cannot be inserted. 

51 Use Equation ([Et]) to compute A X p, Ayp, A NP for NXY. 

52 Let G {X, Y~} n def(Pi) and A 2 G {X, Y} n def (P 2 ). 

53 For each i = 1 or 2, if Pj is an internal node of T* 

54 then use Equation ( |2.7j ) to compute Ax;P; for the triplet formed by def (Pi) 

55 else set Ax 4 p 4 <— 0. 

56 Set Ai <- A XlP - A Xl p and A 2 «- A X2 p - A X2 p 2 . 

57 if |Ai| < A 

min 

or |A 2 | < A 

min 

58 then return "too close" 

59 else begin 

510 if P 2 (respectively, Pi) is on the path between Pi and X\ (P 2 and X 2 ) in T* 

511 then set Ai < Ai (A 2 < A 2 ) 

512 else set Ai <- Ai (A 2 <- A 2 ). 

(Remark. Since X\ may equal X 2 , the tests for P\ and P 2 are both needed.) 

513 Set A'/ «- (Ai + Ap iPa - A 2 ) /2 and A 2 ' «- (A 2 + Ap iPa - Ai)/2. 
(Remark. A" + A 2 = Ap iP2 , A" estimates Ap x p, and A 2 estimates Ap 2 p.) 

514 if A'/ > A Pi p 3 or A 2 ' > A^p 2 

515 then return "outside this edge" 

516 else return "split", A'{, A 2 ', Ajyp. 

517 end. 

Fig. 3.3. The Split-Edge subroutine. 



and NXY can be used to insert N and P into T* . We record an insertion by letting 
def(P) = {N, X,Y}; for notational uniformity, let def(X) = {X} for all leaves X. 

At line F6, S is an array indexed by the leaves M of P. At the beginning of each 
iteration of the repeat, S[N] stores the most suitable PiP 2 and NXY for inserting TV 
into T*. S is initialized at line F6; it is updated at lines F13 and F14 after a new leaf 
and a new internal node are inserted into T* . The precise content of S is described 
in Lemma 3.6. 

To further specify <S[AT], we call NXY relevant for PiP 2 S T? if it is positive, 
N ^ T£, X <E def(Pi), Y G def(P 2 ), and P X P 2 is on the path between X and Y in TjT. 
We use Split-Edge to determine whether the center P of a relevant NXY is strictly 
between Pi and P 2 in T. We also use Split-Edge to calculate an estimation Ap, p „ of 
Ap,p» for each edge P'P" G T*, which is called the length of P'P" in T*. Split-Edge 
has three possible outcomes: 

1. At line S8, P is too close to Pi or P 2 to be a different internal node. 

2. At line 315, P is outside the path between Pi and P 2 in T and thus should 
not be inser ted into T£ on PiP 2 . 

3. At line 316, P is strictly between Pi and P 2 in T. Thus, P can be inserted 
between Pi and P 2 in T£, and the lengths Ap p , A* Pi>p , A* NP of the possible 
new edges PiP, P 2 P, and NP are returned. 

In the case of the third outcome, NXY is called a splitting triplet for P1P2 in T£, and 
(P1P2, NXY, P, Ap P , Ap 2 p, A^p) is a splitting tuple. Each 5 [AT] is either a single 
splitting tuple or null. In the latter case, the estimated closeness of the triplet in S[N] 
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is regarded as for technical uniformity. 

Fast-HGT ensures the accuracy of T* in several ways. 



positive triplets to recover internal nodes of T at lines Fl and F9 



The algorithm uses only 
These two lines 



together form the greedy strategy of Fast-HGT. The maximality of the tr iple t ch osen 
at these two lines favors large triplets over small ones based on Lemmas 2.6 and 2.7. 
With a relev ant tr iplet as input, Split-Edge compares P to P\ and P2 using the rule 
of Equation ( [2.10 ) and can estimate the distance between P and P\ or P2 from the 
same leaf to avoid accumulating estimation errors in edge lengths. 

The next lemma enables Fast-HGT to grow T* by always using relevant triplets. 



Lemma 3.1. For each k = 3, . 



at the start of the {k — 2)-th iteration of 
the repeat at line [FJ def(P x ) n def(P 2 ) 7^ for every edge PiP 2 G T%. 

Proof. The proof is by induction on k. The base case follows from the fact that 
the statement holds for T 3 * at line fF3l The induction step follows from the use of a 



relevant triplet at line F9 □ 



Remark. A subsequence work |6| shows that Fast-HGT can run with the same 
time, space, and sample complexities without knowing / and A m i n ; this is achieved 
by slightly modifying some parts of Split-Edge. 

3.2. The running time and work space of Fast-HGT. Before proving the 



desired time and space complexities of Fast-HGT in Theorem 3.2 below, we note the 



following three key techniques used by Fast-HGT to save time and space. 

1. At line Fl, ABC is selected for a fixed arbitrary A. This limits the number 

This technique is supported by 



of triplets considered at line Fl to O ( 



the fact that ea ch le af in T is contained in a large triplet. 

2. At lines F6 and F14 , S keeps only splitting tuples. This limits the number of 
triplets consider ed fo r each involved edge to O (n). This technique is feasible 
since by Lemma 3.5, ^(T) can be recovered using only relevant triplets. 

3. At line F14 , S includes no new splitting tuples for the edges Q1Q2 that 
already exist in T* before N is inserted. This technique is feasible because 
the insertion of N results in no new relevant triplets for such Q1Q2 at all. 

Theorem 3.2. Fast-HGT runs in O (n 2 ) time using O (n) work space. 
Proof. We analyze the time and space complexities separately as follows. 
Time complexity. Line Fl takes O (n 2 ) time. Line |F(] takes O (n) total time to 
examine 2(n — 3) triplets for each Q\Q 2 . As for the repeat at line [FT], lines F8. F9. 
and F13 take O (n) time to search through S. For the (k — 3)-th iteration of the 
repeat where k = 4, . . . ,n — 1, line F14 takes O (n) total time to examine at most 
9(n — k — 1) triplets for each of PiP, P 2 P and NP. Thus, each iteration of the repeat 
takes O (n) time. Since the repeat iterates at most n — 3 times, the time complexity 
of Fast-HGT is as stated. 

Space complexity. T* and the sets def (G) for all nodes G in T* take O (n) work 

~~ I and 



space. S takes O (n) space. Lines [Fl| |F6| and [F1J in Fast-HGT and lines |Ul| and |U2 
in Update-S can be implemented to use 0(1) space. The other variables needed by 
Fast-HGT take O (1) space. Thus, the space complexity of Fast-HGT is as stated. □ 

3.3. Technical lemmas for bounding the sample size. Let Lk be the set of 

the leaves of \I/(T) that are in Tj*. Let be the subtree of \&(T) formed by the edges 
on paths between leaves in Lk- A branchless path in is one whose internal nodes 
are all of degree 2 in ^>k- We say that matches T if T£ without the edge lengths 
can be obtained from by replacing every maximal branchless path with an edge 
between its two endpoints. 
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For k = 3, . . . , n, we define the following conditions: 

• Ak- Tj* matches T. 

• Bk- For every internal node Q £ T£, the triplet formed by def (Q) is not small. 

• C k - For every edge QiQ 2 e T£ 
In this section, Lemmas I3.3L 3.4 



> l A o^2 - A QiQ 2 | <2A min . 
and p.q analyze under what conditions Split- 
Edge can help correctly insert a new leaf and a new internal node to T£. Later in 



§3.4, we use these lemmas to show by induction in Lemma 3.7 that the events £ g 
and £ c , which are defined before Lemma |2.9|, imply that Ak, Bk, and Ck hold for all k. 



This leads to Theorem 3.8, stating that Fast-HGT solves the weighted evolutionary 
topology problem with a polynomial-sized sample. 

Lemmas |3.3| , 3.4, and 3.5 make the following assumptions for some k < n: 
• The (k — 3)-th iteration of the repeat at line F7 has been completed. 



• T£ has been constructed, and Ak, Bk, and Ck hold. 

• Fast — HGT is currently in the (k — 2)-th iteration of the repeat. 

Lemma 3.3. Assume that £ c holds and the triplet NXY input to Split-Edge is 
not small. Then, the test of line fails if and only if P ^ P\ and P ^ P% in T . 
Proof. There are two directions, both using the following equation. By line p6L 



(3.1) 



Ai = (A Xl p - A Xl p) - (A XlPl - A XlPl ) + (A Xl p - A XlPl ). 



= P 2 in T, If P = Pi, then 
By Bk, the triplet formed 
| Ai| < A min . By symmetry, 



>) To prove by contradiction, assume P = Pi or P 
A Xl p = A Xl p iy and by Ak, Pi is an internal node in Tt 
by def (Pi) is not small. Thus, by £ c and Equation (3.1) 
if P = P2, then IA2I < A m ; n . In either case, the test of line 37 passes. 

(^=) Since P + Pi, A XlP - A XlPl > - ln(l - af) > 2A min . If Pi is a leaf in T*, 
then by Ak, Pi is leaf X\ in T, and A Xl p 1 = A Xl p 1 = 0. By £ c and Equation (3.1), 
|Ai| > 1.5A m ; n . If Pi is an internal node in T£, then by Bk, £ c , and Equation (3.1), 
we have |Ai| > A m ; n . In either case, |Ai| > A min . By symmetry, since P ^ P2, 
|A 2 | > Amin. Thus, the test of line || fails. □ 

Lemma 3.4. In addition to the assumption in Lemma 3.S, also assume that 
P ^ Pi and P 7^ P2 in T, i.e., the test of line ^ has failed. Then, the test of line S14 
fails if and only if P is on the path between Pi and P2 in T. 
Proof. There are two directions. 
(<=) From lines || |l| and Corollary |J(|). 



(Ai - A' 2 ) - (A 



PiP 



Ap 2P ) = ± (A 



^XiP 



A 



XiP 



± (A 



X 2 P 



A 



X 2 P 



)-(A 
)-(A 



X1P1 



A 



X 2 P 2 



^X 2 P 2 



Thus, whether Pi and P2 are leaves or internal nodes in T^, by Ak, Bk, 
|(Ai - A' 2 ) - (A PlP - Ap 2 p)\ < 2A min . By line |l| and Corollary |J(|), 



and £ c 



A'(< 



+ (A PlP - Ap 2 p) + A 



PiP 2 



2(2A n 



^p 2 p 



(-2A, 



iPiP 2 



*PiP 2 



Then, since P 7^ P 2 and thus Ap 2 p > 2A m j n , b y Ck, we have A" < A PiP ^. By 
symmetry, A' 2 ' < A* Pi p 2 . Thus, the test of line (B14| fails. 

(=>■) To prove by contradiction, assume that P is not on the path between Pi 
and P 2 . By similar arguments, if Ap t p > Ap 1 p 2 (respectively, Ap,p > Ap 1 p 2 ), then 
A" > A Pi p 2 (respectively, A' 2 ' > A* Pi p 2 ). Thus, the test of line S14 passes. □ 
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NQ 





Fig. 3.4. This subgraph of T fixes some notation used in the proof of Case 1 of Lemma 
The location of Y\ relative to Y 2 and Z 2 is nonessential; for instance, Y\ can even be the same as 
Y 2 . In T*, def(Pi) = {X, Y u Z{\ and def(P 2 ) = {X, Y 2 , Z 2 }. Neither XY X Z X nor XY 2 Z 2 is small, 
and Ap 2 y s < Ap 2 z 2 . We aim to prove that there is a leaf N T£ such that NXY 2 or NZ\Y 2 is 
large and defines a node P strictly between Pi and P 2 in T. 



Lemma 3.5. Assume that P1P2 is an edge in Tl and some node is strictly between 
Pi and P2 inT. Then there is a large triplet NQ1Q2 with center P such that N ^ Tl, 
Qi G def(Pi), Q2 € def(P2), and P is strictly between P\ and P2 in T . 

Proof. By Lemma 2^(|), for every node P strictly between Pi and P2 in T, there 
exists a leaf N Tt with apN > (1 — ag) d+1 . To choose P, there are two cases: (1) 
both Pi and P2 are internal nodes in T£, and (2) Pi or P2 is a leaf in T£. 

Case 1. By Lemma |1J let def(Pi) = {X, Y u Z x } and def (P 2 ) = {X, Y 2 , Z 2 }. By 
Bk, neither XY2Z2 nor XY\Z\ is small. To fix the notation for def(Pi) and def(p2) 
with respect to their topological layout, we assume without loss of generality that 
Figure 3.4 or equivalently the following statements hold: 

• In T£ and thus in T by Ak, P2 is on the paths between Pi and Y2, between Pi 
and Z2, and between Pi and Yi, respectively. 

• Similarly, Pi is on the paths between P2 and Z\ and between P2 and X. 

• Ap 2 y 2 < Ap 2 z 2 . 

Both NXY2 and N Z1Y2 define P, and the target triplet is one of these two for some 
suitable P. To choose P, we further divide Case 1 into three subcases. 

Case la: crxp 2 < oy 2 p 2 {\ — ag) and cty 2 p 1 < axp^l — ag). The target triplet 
is NXY 2 . Since axY 2 < \ /0 ~xy 2 , by Corollary 2.3(3), let P be a node on the path 
between X and Y 2 in T with y/crxY 2 {l ~ ctg) < axp < oxy 2 (^ — and thus 

by Lemma 2.1 \J gxy 2 (1 ctg) < ay 2 p < \J ctxy 2 ( 1 — ctg) _ 1 . By the condition of 



Case la and Lemma 2.1, P is strictly between Pi and P2 in T. Also, by Corollary 2.4, 



&xy 2 > -%o-xy 2 z 2 - Thus, by Lemma 2.1, since XY2Z2 is not small, 



(3.2) 



0~NXY 2 



<?XY 2 



> 



3 a XY 2 Z 2 



(1 - ag)-*-V* + \a x \ 



> (Tig. 



2 U XY 2 Z 2 



So NXY 2 is as desired for Case la. 

Case lb: o~xp 2 > o-y 2 p 2 (\ — ag). The target triplet is NXY2. Let P be the first 
node after P2 on the path from P2 toward Pi in T. Then, oy 2 p > °~y 2 p 2 (1 ~ ctg)- 
By Corollary 2A, a Y p > cxy 2 z 2 (l — ctg) 2 /2>. Next, since o~xy 2 > o~xz 2 and crp 2 y 2 > 
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0P 2 Z 2I 

axYZ < 3 < 3 < ^ 

2^ + ffr a ft^ftz a " 2(7^^ + ^ 2 p 2 " 2 ( X ~ Q 3) + C 1 ~ «ff) 2 ' 
So o~ XP > o~ xp > o"xy 2 z 2 (1 — ag) 2 . Since <Jxy 2 > fcrfYbZa and XY2Z2 is not small, 

3 



(3.3) ctnxy 2 



GXPGPN &Y 2 P&PN a XY 2 

1 

> — > (Tig. 



So NXY2 is as desired for Case lb. 

Case lc: cry 2 p 1 > oxp^l — ag). If CZ1P1 > cxft, the target triplet is NZ1Y2; 
otherwise, it is NXY2. The two cases are symmetric, and we assume o~xp 1 > o~z 1 p 1 - 
Let P be the first node after Pi on the path from Pi toward P2 in T. Then, oxp > 



ctxp! (1 - ag). By Corollary a xp > cr^ Pi (1 - ag) 2 > cr X y lZl (1 - ag) 2 /3. Since 
CTxYa > cxz 2 and oy 2 z 2 > 0, 

<r -J— < 3 < 3 4,Px 

™ ^xy 2 ~ ^yIpSx\ - 2(1 -Off)' 

Hence <7y 2 p > 0y- p > 2crss:y 2 ,z 2 (l — ct.9)/3. Then, since neither nor XY1Z1 is 

small and oxy, > ^<txy 2 z 2 , 

3 

(3.4) (TArxy 2 = T ; t ; 1 — 



&XP&PN &Y 2 P<7PN < 7 XY 2 

1 

> ^xAO- *9)- d - 2 + js'xVXO- ~ a 9)- d ^ 2 + Wy 2 z 2 > ^ 



So NXY 2 is as desired for Case lc with a X p x > o Zl p 1 - 

Case 2. By symmetry, assume that P2 = X is a leaf in T£. Since fc > 3, P\ 
is an internal node in T)*. Let def(Pi) = {X,Y,Z}. By symmetry, further assume 
&yp 1 > azp^- There are two subcases. If cxPi < ovp^l — ag), the proof is similar 
to that of Case la and the desired P is in the middle of the path between X and Y 
in T. Otherwise, the proof is similar that of Case lb and P is the first node after Pi 
on the path from Pi toward X in T. In both cases, the desired triplet is NXY . □ 

3.4. The sample size required by Fast-HGT. The next lemma analyzes S. 
For k = 3, ... ,n — 1 and each leaf M € T, let Sfc[M] be the version of S[M] at the 



start of the (k — 2)-th iteration of the repeat at line F7. 

Lemma 3.6. Assume that for a given k < n — 1, £ g , £ c , Aw , Bw , and Cw hold 
for all k' < k. 

1. If Sk[M] is not null, then it is a splitting tuple for some edge in Tj*. 



2. If an edge Q1Q2 € Tj£ and a triplet MR1R2 with M ^ T£ satisfy Lemma 3.5, 
then Sk[M] is a splitting tuple for Q1Q2 in T£ that contains a triplet MR'iR' 2 

With 6- MR > iR > 2 > aMR 1 R 2 - 

Proof. The two statements are proved as follows. 

Statement [|. This stateme nt fol lows directly from the initialization of S at lin e 



F6. the deletions from S at line F13. and the insertions into S at lines F6 and F14 
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Statement |[ The proof is by induction on k. 

Base case: k = 3. By £ c , A3, B3, C3, and Lemmas 3.3 and 3.4 



MRiR 2 is a 



splitting triplet for Q1Q2 in T 3 *. By the maximization in Update-5 at line F6, S [M] 



is a splitting tuple for some edge Q'iQ' 2 € T3 that contains a triplet MR^R^ with 
o'mr' Ri, > o'mr 1 r 2 - By £ g , MR^R'2 is not small. By Lemmas 3.3 and 3.4, Q[Q 2 is 
Q1Q2. 2 

Induction hypothesis: Statement || holds for /c < n — 1 . 

Induction step. We consider how 5/c+i is obtained from <Sfc during the (fc — 2)-th 



iteration of the repeat at line F7. There are two cases 



Case 1: Q1Q2 also exists in Tt. By Ak, Q1Q2 and M R1R2 also satisfy Lemmas 3.3 



and 3.4 for T£. By the induction hypothesis, 6>fc[M] is a splitting tuple for Q1Q2 in Tj* 



that contains a triplet MR^R^ with <Jmr: x r!^ > ^MRxR^- Then, since Q1Q2 P\Pi 
and M ^ N &t line F13, 6>/c[M] is not reset to null. Thus, it can be changed only 
through replacement at line F14 by a splitting tuple for some edge Q\Q' 2 in T£ +1 that 
contains a triplet MR'{R 2 with o MR n R n > g M R , R , , By £ g , MR'lR^' is not small. 
Thus, by £ c , Ak+i, Bk+i, Ck+i, and Lemmas |3.3| and 3^, Q'iQ 2 is Q\Q2- 

Case 2: Q1Q2 ^ ^jf. This case is similar to the base case but uses the maximiza- 
tion in Update-iS at line 
Lemma 3.7. £„ and £, 



F14 



□ 



and Ck hold for all k — 3, . . . , n. 



line 



imply that At , Bk , 
Proof. The proof is by induction on k. 

Base case: k = 3. By Lemma ^^(||), £ c , and the greedy selection of line Fl, 
F3 constructs T 3 * without edge lengths. Then, A3 holds trivially. B3 follows from £ c , 
£ g , and line Fl. C3 follows from B3, £ c and the use of Equation (2.7) at line F4. 

Induction hypothesis: Ak, Bk, and Ck hold for some k < n. 

Induction step. The induction step is concerned with the (k — 2)-th iteration of 



the repeat at line F7. Right before this iteration, by the induction hypothesis, since 
k < n, some N'Q\Q2 satisfies Lemma 3.5. Therefore, during this iteration, by £ c and 
Lemmas 3.3, 3.4, and 3.6, S at line F8 has a splitting tuple for T£ that contains a 
triplet NXY with o^xy > Qn'Q aQo- Fur thermore, line F9 finds such a tuple. By 



and Fll| create T£ +1 using this triplet. Thus, Bk+i 
and^3.4, Ak+i follows from Ak- Ck+i follows from 



£ e , NXY is not small. Lines |F10 
follows from Bk- By Lemmas 3J3 
Ck since the triplets involved at line S13 are not small. □ 

Theorem 3.8. For any < S < 1, using sequence length 



f log \ + log n 
' a.g) 4d+8 / 2 c 2 

Fast-HGT outputs T* with the properties below with probability at least 1 — 5: 

1. Disregarding the edge lengths, T* = ^ W (T). 

2. For each edge QiQ 2 in T* , |A^ iQ2 - A Qi q 2 | < 2A min . 
Proof. By Lemma |J, Pr{ £ g } < f if 



d c f o 3 In n + In 4 
I > L = 210a 2 „ 5 - 



Similarly, by Lemma 2.£, Pr{ £ c } < | if 



def 3 In n + In I 
> L = 81- 
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We choose £ = |~max{£ g , £ c }~\ . Consequently, Pr{ £ g and £ c } > 1 — 8. By Lemma 3.7 , 
with probability at least 1 — 6, Fast-HGT outputs T*, and A n and C n hold, which 
correspond to the two statements of the theorem. □ 

4. Further research. We have shown that theoretically, Fast-HGT has the op- 
timal time and space complexity as well as a polynomial sample complexity. It would 
be important to determine the practical performance of the algorithm by testing it 
extensively on empirical and simulated trees and sequences. Furthermore, as conjec- 
tured by one of the referees and some other researchers, there might be a trade-off 
between the time complexity and the practical performance. If this is indeed true 
empirically, it would be significant to quantify the trade-off analytically. 

Acknowledgments. We thank Dana Angluin, Kevin Atteson, Joe Chang, Jun- 
hyong Kim, Stan Eisenstat, Tandy Warnow, and the anonymous referees for extremely 
helpful discussions and comments. 

Appendix A. Proofs of technical lemmas. 



A.l. Proof of Lemma Let h X y = Z2£X -\ h X z = ££a ; h YZ = — • 

. C_l ^' <7XY' &Xz' ' * TYZ 

By Equations ( J2.6P and Q2.7| ), and by conditioning on the events {hxz < 1 — r} 
and {hyz > 1 + s} for some r, s > 0, 

Prj Aa-p - A X p > - ln( ^ - £) | = Pr{ hxyhxz < h YZ (l ~ e) } 
< Pr{ h X z<l-r} + Pr{ h YZ > 1 + s } + Prj h XY < (1 ' 



1 - r 



Setting ±=£ > 1 - e, by Equations (gj) and (gj) 



x * -ln(l-e) 

A X p - Aa-p ^ 




~£a xz r 2 J + exp (-^2 itj yzs 2 j + exp ( ~^xy f 1 - C 1 ~ 



Equating these exponential terms yields equations for r and s. The solution for r is 



t — \/t 2 — U . . 2 

— ; * = ctxy&yz + <?xz<Jyz + (1 — ejo'A'yO'xz; u = 4ffxyffyz<7xze- 

toxz&YZ 



Using Taylor's expansion, for w > 0, (t — \/t 2 — u) 2 > jp. Thus, 



2 ^r 2 

2 ^ c . c "ayz 



e a 

r" > ^ > — — - 2 

f^ + ^ + ^l <ri z ^ xz 

\OXZ <?YZ & XY I AZl 



So Pr{ Aap - Aap > ^1^1 } < 3 exp (-^£a 2 xz r 2 ) < 3 exp | 



A.2. Proof of Lemma 2.7. We use the following basic inequalities. 



. , I <jxy oxz oyz 1 . cta-yz _ [ ctay cxz ^yz 
(A.l) mm <^ , , > < < max < , , 

&XY OXZ OYZ) OXYZ {PXY <?XZ °Y Z 

(A.2) °xyz ^ min ( axY (TxZj ayz j 
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The proof of Equation (2.9) is symmetric to that of Equation ( |2.§| ). So we only 
prove the latter. Pick A > 1 with ctxyz — ci g A. Without loss of generality, we assume 

min(^, 

y a XY &XZ &YZ ) 



By Equations Q, Q, and Q , 



Pr{ ffxyz < cr rod } = Pr 
2 



crxrz crigA 



< Pr 



< exp 



^ 1 



(Tl g A 



'XY 



< exp 



V 



2 ( 1 _ £md 

9a 2 



(^XY 0"md 

UXY ~~ CTlgA 
2 



Then, Equation (2.5) follows from the fact that by the choice of <7 m d, 



9a 2 



36a 2 



A. 3. Proof of Lemma 2.8. Since Lemma 2.5 can help establish only one half of 
the desired inequality, we split the probability on the left-hand side of Equation ( [2.11 ) . 



Prj Axp-Axp > 
< Prj A XP - A XP > 
Pr| A X p - A X p < 



A, 



2 

^min 

6 



Pr<^ A Y p - A Y p > 



A,. 



Ayp — Ayp < 



G 



Then, since Axy — Axy = (Axp — Axp) + (Ayp — Ayp), we have 



Pr<^ A X p - A XP < 



A,. 



Ayp — Ayp < 



< Pr<! A X y - A X y < - 



A,. 



Consequently, 
(A.3) Pr 



\xp — l±XP 



> 



< Pr< A 



\xp — "XP 



> 



Prj Ayp - Ayp > 
Prj Axy - Axy < 



A n 



G 

A, 



+ 



By Lemma 2.5 



e 3 



Prj Axp - A X p > ^ } < 3 exp f--^£a XYZ (l 



By Taylor's expansion, ( 1 — e ™ m ) > (l — (1 — af)^ 2 > ^-a 2 / 2 , and thus 



(A.4) 



Pr<^ Axp - A X p > 
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By symmetry, 



(A.5) 



By Equation 



Pr<^ Ay P - A Y p > 



< 3exp 



81 



0, Pr{ 



< 



< exp 



■ £4£y ( 



e 3 



From Equation ( |A.2| ), oxy > - — • By Taylor's expansion, ( e~s~ — lj 



> 



((1 - af)-i - 1) > fa 2 / 2 - Therefore 
(A.6) 



*XY 



> 



Lemma 2.8 follows from the fact that putting Equations ( |A.3| ) through (A.6) together, 
we have Pr| 
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